Temporal-Difference Methods

Back to Home

01. Introduction
02. Review: MC Control Methods
03. Quiz: MC Control Methods
04. TD Control: Sarsa
05. Quiz: Sarsa
06. TD Control: Q-Learning
07. Quiz: Q-Learning
08. TD Control: Expected Sarsa
09. Quiz: Expected Sarsa
10. TD Control: Theory and Practice
11. OpenAI Gym: CliffWalkingEnv
12. Workspace - Introduction
13. Coding Exercise
14. Workspace
15. Analyzing Performance
16. Quiz: Check Your Understanding
17. Summary

Back to Home

05. Quiz: Sarsa

Quiz: Sarsa

Say that an agent is learning to navigate the gridworld described earlier in the lesson.

Suppose the agent is using Sarsa in its search for the optimal policy, with \alpha=0.1.

At the end of the 99th episode, the Q-table has the following values:

Say that at the beginning of the 100th episode, the agent starts in state 1 and selects action right. As a result, it receives reward -1, and the next state is state 2.

Then, at the next timestep, the agent selects action right.

In the previous video, you learned that at this point in time, the agent updates the Q-table.

Which entry in the Q-table is updated?

The entry corresponding to state 1 and action left.

The entry corresponding to state 2 and action left.

The entry corresponding to state 1 and action right.

The entry corresponding to state 2 and action right.

SOLUTION:

The entry corresponding to **state 1** and **action right**.

What is the new value in the Q-table corresponding to the state-action pair you selected in the answer to the question above?

(Suppose that when selecting the actions for the first two timesteps in the 100th episode, the agent was following the epsilon-greedy policy with respect to the Q-table, with epsilon = 0.4.)

6.1

6.16

6.2

SOLUTION:

6.1

udacimak v1.2.1